Skip to content

LCORE-1518: Skip model authentication against Azure#1766

Open
asimurka wants to merge 2 commits into
lightspeed-core:mainfrom
asimurka:azure_skip_model_availability
Open

LCORE-1518: Skip model authentication against Azure#1766
asimurka wants to merge 2 commits into
lightspeed-core:mainfrom
asimurka:azure_skip_model_availability

Conversation

@asimurka
Copy link
Copy Markdown
Contributor

@asimurka asimurka commented May 19, 2026

Description

Optimizes Azure Entra ID token handling by skipping startup authentication for Azure models. This allows to deffer token acquisition to inference requests allowing to remove prestart token acquisition and passing by environment variables.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Cursor

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Refactor

    • Azure authentication now obtains and supplies tokens at runtime via secure per-request headers; tokens are cached in memory rather than injected from environment files.
    • Configuration enrichment disables Azure model validation automatically and no longer requires container .env files at startup.
  • Chores

    • Startup/run process updated to generate and use an enriched configuration before launching the service.
  • Documentation

    • Updated Azure Entra ID authentication guide and examples to reflect deferred auth and new config fields.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Caution

Review failed

Failed to post review comments

Walkthrough

This PR moves Azure Entra ID token handling from startup-time .env generation to in-memory caching and per-request header injection. It adds client-holder methods to refresh and apply Azure provider data, enriches configs to disable Azure model validation at startup, updates endpoints/startup/scripts/tests, and revises docs/examples.

Changes

Azure Token Handling Refactoring: In-Memory Caching & Runtime Acquisition

Layer / File(s) Summary
Token Manager In-Memory Storage
src/authorization/azure_token_manager.py, tests/unit/authorization/test_azure_token_manager.py
AzureEntraIDManager now caches _access_token and _azure_base_url in memory, exposes access_token/azure_base_url properties, set_base_url() and build_azure_provider_data(). _update_access_token() no longer writes to environment variables; tests updated for in-memory semantics.
Configuration Enrichment with Model Validation
src/llama_stack_configuration.py, tests/unit/test_llama_stack_configuration.py
Adds enrich_azure_entra_id_inference() to set model_validation: false for remote::azure when azure_entra_id is configured; removes token-generation and --env-file handling. generate_configuration() signature updated to drop env_file.
Client Holder Token Update & Azure Base URL Query
src/client.py, tests/unit/test_client.py
Adds AsyncLlamaStackClientHolder.update_azure_token() (async) to obtain Azure provider data and rebuild or update the held client (library vs service modes). Adds get_azure_base_url() and integrates enrich_azure_entra_id_inference() into library-mode enrichment; old update_provider_data() removed. Tests cover header injection and deferral behavior.
Endpoint Handler Token Refresh Integration
src/app/endpoints/query.py, src/app/endpoints/responses.py, src/app/endpoints/streaming_query.py, src/app/endpoints/rlsapi_v1.py, src/utils/query.py, tests/unit/app/endpoints/*, tests/unit/utils/test_query.py
Endpoint handlers now refresh Azure tokens by calling AsyncLlamaStackClientHolder().update_azure_token(); the standalone update_azure_token() helper was removed from utils/query.py. Endpoint tests were updated to mock the client holder method.
Startup & Runtime Integration
src/app/main.py, scripts/llama-stack-entrypoint.sh, tests/e2e-prow/.../llama-stack-openai.yaml, tests/e2e-prow/.../llama-stack-prow.yaml
App lifespan no longer performs immediate startup token minting; it initializes the client, then configures AzureEntraIDManager and sets its base URL using get_azure_base_url(). Entrypoint scripts and manifests stop sourcing/passing a .env file during enrichment.
Documentation & Configuration Examples
docs/providers.md, docs/rag_guide.md, examples/azure-run.yaml, tests/e2e/configs/run-azure.yaml
Docs updated to describe deferred inference-time Entra ID auth via X-LlamaStack-Provider-Data, in-memory token caching, and removal of env-file usage; example configs drop api_key and add model_validation: false.
Makefile Simplification
Makefile
run-llama-stack target now generates an enriched config via src/llama_stack_configuration.py before running uv run llama stack run with the enriched config.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • tisnik
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Skip model authentication against Azure' directly and specifically describes the main objective of the changeset: deferring Azure Entra ID token acquisition from startup to inference time by setting model_validation to false for Azure providers.
Docstring Coverage ✅ Passed Docstring coverage is 97.37% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/client.py (1)

257-263: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden provider-data JSON parsing before merge.

At Line 258, valid JSON that is not an object (for example, a list) will pass decoding and then fail at Line 262 on .update(...).

Proposed fix
-        try:
-            provider_data = json.loads(provider_data_json) if provider_data_json else {}
-        except (json.JSONDecodeError, TypeError):
-            provider_data = {}
+        try:
+            parsed = json.loads(provider_data_json) if provider_data_json else {}
+            provider_data = parsed if isinstance(parsed, dict) else {}
+        except (json.JSONDecodeError, TypeError):
+            provider_data = {}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/client.py` around lines 257 - 263, The JSON parsing for provider data
currently assumes json.loads(provider_data_json) yields a dict and then calls
provider_data.update(updates); to fix, after decoding (and in the existing
except block) ensure the decoded value is a dict (use isinstance(provider_data,
dict)) and if it's not, replace it with an empty dict before calling
provider_data.update(updates); keep the existing exception handling for
JSONDecodeError/TypeError and apply this check where provider_data and
provider_data_json are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/providers.md`:
- Around line 95-107: The docs use two different config keys—`base_url` in the
prose and `api_base` in the YAML—causing inconsistency with the actual schema;
pick the correct schema key (whichever the code expects for the `remote::azure`
provider) and update both the descriptive text and the example so they match
(for example, replace `api_base` with `base_url` or vice versa) and keep
references to `azure_entra_id`, `model_validation: false`, `provider_id: azure`,
and `provider_type: remote::azure` unchanged.
- Around line 114-118: Update the lifecycle text under "Lightspeed startup
(library and service mode)" to remove the claim that Lightspeed "Acquires an
initial access token from Microsoft Entra ID" and "Caches the token in memory
(`SecretStr`, per worker)" at startup; instead state that Lightspeed reads Entra
ID configuration at startup but defers authentication until request time,
acquiring and caching access tokens on demand per-request (and still
initializing the Llama Stack client with `X-LlamaStack-Provider-Data` as
before). Ensure the revised step explicitly reflects the runtime-only,
deferred-auth flow implemented in the PR.

In `@src/client.py`:
- Around line 245-250: When initializing the AsyncLlamaStackAsLibraryClient
inside the token-refresh path, wrap the await client.initialize() call in a
try/except that catches initialization failures (broad Exception or the specific
initialization error type) and translate them into a service-unavailable error
(e.g., raise or return a ServiceUnavailableError / HTTP 503) instead of letting
the exception bubble; set self._lsc only after successful initialization (leave
self._lsc unchanged on failure) and include the client
variable/AsyncLlamaStackAsLibraryClient.initialize in your handling so callers
receive a 503-style backend-unavailable response when initialization fails.

In `@tests/unit/app/endpoints/test_streaming_query.py`:
- Line 731: The test creates an unused mock "mock_updated_client" and sets
update_azure_token to return it but never asserts it's used; either remove the
unused mock_updated_client and its return configuration to simplify the test, or
if you intend to verify the updated client is applied, change the assertions to
check that get_client() (or subsequent operations that use the client)
returns/uses mock_updated_client after calling update_azure_token; update
references to AsyncLlamaStackClient, mock_updated_client, and update_azure_token
accordingly.

---

Outside diff comments:
In `@src/client.py`:
- Around line 257-263: The JSON parsing for provider data currently assumes
json.loads(provider_data_json) yields a dict and then calls
provider_data.update(updates); to fix, after decoding (and in the existing
except block) ensure the decoded value is a dict (use isinstance(provider_data,
dict)) and if it's not, replace it with an empty dict before calling
provider_data.update(updates); keep the existing exception handling for
JSONDecodeError/TypeError and apply this check where provider_data and
provider_data_json are used.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 23accaba-033e-4848-bb78-61bb7f096894

📥 Commits

Reviewing files that changed from the base of the PR and between 559753d and 3324de6.

📒 Files selected for processing (22)
  • Makefile
  • docs/providers.md
  • docs/rag_guide.md
  • examples/azure-run.yaml
  • scripts/llama-stack-entrypoint.sh
  • src/app/endpoints/query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
  • src/app/main.py
  • src/authorization/azure_token_manager.py
  • src/client.py
  • src/llama_stack_configuration.py
  • src/utils/query.py
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml
  • tests/unit/app/endpoints/test_query.py
  • tests/unit/app/endpoints/test_responses.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/authorization/test_azure_token_manager.py
  • tests/unit/test_client.py
  • tests/unit/test_llama_stack_configuration.py
  • tests/unit/utils/test_query.py
💤 Files with no reviewable changes (4)
  • Makefile
  • tests/unit/utils/test_query.py
  • docs/rag_guide.md
  • src/utils/query.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: E2E: library mode / ci / group 2
  • GitHub Check: E2E: library mode / ci / group 3
  • GitHub Check: E2E: server mode / ci / group 1
  • GitHub Check: E2E Tests for Lightspeed Evaluation job
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (3)
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

  • tests/unit/app/endpoints/test_query.py
  • tests/unit/app/endpoints/test_responses.py
  • tests/unit/test_llama_stack_configuration.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/authorization/test_azure_token_manager.py
  • tests/unit/test_client.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Llama Stack imports: Use from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Use async def for I/O operations and external API calls
Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes must use ABC with @abstractmethod decorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes

Files:

  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/responses.py
  • src/app/main.py
  • src/authorization/azure_token_manager.py
  • src/llama_stack_configuration.py
  • src/client.py
  • src/app/endpoints/query.py
src/app/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/app/**/*.py: FastAPI dependencies: Import from fastapi module for APIRouter, HTTPException, Request, status, Depends
Use FastAPI HTTPException with appropriate status codes for API endpoints and handle APIConnectionError from Llama Stack

Files:

  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/responses.py
  • src/app/main.py
  • src/app/endpoints/query.py
🧠 Learnings (4)
📚 Learning: 2026-05-12T15:14:34.788Z
Learnt from: syedriko
Repo: lightspeed-core/lightspeed-stack PR: 1727
File: scripts/konflux_requirements.sh:9-15
Timestamp: 2026-05-12T15:14:34.788Z
Learning: In this repo, the `.konflux/` directory is committed/tracked and is guaranteed to exist in a fresh clone. Therefore, shell scripts that write output under `.konflux/` (e.g., create files like `.konflux/<...>`) should not waste effort by calling `mkdir -p .konflux` first. Only add directory-creation logic if the script may run in an environment/repo state where `.konflux/` might not be present.

Applied to files:

  • scripts/llama-stack-entrypoint.sh
📚 Learning: 2026-02-19T10:06:50.647Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.

Applied to files:

  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.

Applied to files:

  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/query.py
📚 Learning: 2026-01-14T09:37:51.612Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 988
File: src/app/endpoints/query.py:319-339
Timestamp: 2026-01-14T09:37:51.612Z
Learning: In the lightspeed-stack repository, when provider_id == "azure", the Azure provider with provider_type "remote::azure" is guaranteed to be present in the providers list. Therefore, avoid defensive StopIteration handling for next() when locating the Azure provider in providers within src/app/endpoints/query.py. This change applies specifically to this file (or nearby provider lookup code) and relies on the invariant that the Azure provider exists; if the invariant could be violated, keep the existing StopIteration handling.

Applied to files:

  • src/app/endpoints/query.py
🔇 Additional comments (17)
examples/azure-run.yaml (1)

27-27: LGTM!

src/app/main.py (1)

98-102: LGTM!

scripts/llama-stack-entrypoint.sh (1)

18-18: LGTM!

tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml (1)

163-163: LGTM!

tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml (1)

153-153: LGTM!

src/authorization/azure_token_manager.py (2)

36-42: LGTM!

Also applies to: 59-78


99-106: LGTM!

tests/unit/authorization/test_azure_token_manager.py (1)

56-61: LGTM!

Also applies to: 72-78, 80-92, 163-179

src/llama_stack_configuration.py (2)

46-78: LGTM!


582-583: LGTM!

Also applies to: 632-632

tests/unit/test_llama_stack_configuration.py (1)

15-15: LGTM!

Also applies to: 28-104, 527-527

tests/unit/test_client.py (1)

6-6: LGTM!

Also applies to: 13-13, 16-19, 102-178

src/app/endpoints/query.py (1)

199-207: LGTM!

src/app/endpoints/responses.py (1)

400-408: LGTM!

src/app/endpoints/streaming_query.py (1)

257-265: LGTM!

tests/unit/app/endpoints/test_query.py (1)

544-546: LGTM!

Also applies to: 568-568

tests/unit/app/endpoints/test_responses.py (1)

469-470: LGTM!

Also applies to: 485-485, 505-505

Comment thread docs/providers.md
Comment thread docs/providers.md Outdated
Comment thread src/client.py
Comment thread tests/unit/app/endpoints/test_streaming_query.py
@asimurka asimurka force-pushed the azure_skip_model_availability branch from 3324de6 to ba728c8 Compare May 19, 2026 08:16
Comment thread docs/providers.md
api_key: ${env.AZURE_API_KEY} # Must be exactly this - placeholder for Entra ID token
api_base: ${env.AZURE_API_BASE}
# api_key: ${env.AZURE_API_KEY} # Can be omitted when Entra ID configured in LCORE
base_url: ${env.AZURE_API_BASE}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in attribute name

# shellcheck source=/dev/null
set -a && . "$ENV_FILE" && set +a
fi
-o "$ENRICHED_CONFIG" 2>&1 || ENRICHMENT_FAILED=1
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API key is not passed by .env file anymore

Comment thread src/app/main.py
if azure_entra_id_config is not None:
AzureEntraIDManager().set_config(azure_entra_id_config)
azure_base_url = await AsyncLlamaStackClientHolder().get_azure_base_url()
AzureEntraIDManager().set_base_url(azure_base_url)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In FastAPI lifespan just setup manager's attributes and defer the token acquisition to inference requests.

token = self.access_token.get_secret_value()
if not token or self.azure_base_url is None:
return None
return {"azure_api_key": token, "azure_api_base": self.azure_base_url}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config and validator attribute discrepancy in LLS (api_base vs base_url)

Comment thread src/utils/query.py
return _is_inout_shield(shield) or not is_output_shield(shield)


async def update_azure_token(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced to client holder

@asimurka asimurka requested review from are-ces and tisnik May 21, 2026 06:21
@asimurka asimurka force-pushed the azure_skip_model_availability branch from ba728c8 to daf6dd3 Compare May 21, 2026 06:45
Copy link
Copy Markdown
Contributor

@are-ces are-ces left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just run e2e tests for Azure first TY

@asimurka
Copy link
Copy Markdown
Contributor Author

Fixed 2 issues:

  • edited azure ci config: removed api key, added model_validation: false flag
  • added token refresh logic into rlsapi (the new logic revealed the fact that the logic was missing before)

CI running on my fork, see: https://github.com/asimurka/lightspeed-stack/actions/runs/26219639186/job/77150963063

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants